{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Agentic RAG workflow on tabular data from a PDF file\n",
    "\n",
    "In this notebook, we're building a workflow to extract accurate tabular data information from a PDF file.\n",
    "\n",
    "The following bullets summarize the notebook, with highlights being:\n",
    "\n",
    "- Parse the PDF file and extract tables into images (optional).\n",
    "- A single RAG agent fails to get the accurate information from tabular data.\n",
    "- An agentic workflow using a groupchat is able to extract information accurately:\n",
    "    - the agentic workflow uses a RAG agent to extract document metadata (e.g. the image of a data table using just the table name)\n",
    "    - the table image is converted to Markdown through a multi-modal agent\n",
    "    - finally, an assistant agent answers the original question with an LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "````{=mdx}\n",
    ":::info Requirements\n",
    "Unstructured-IO is a dependency for this notebook to parse the PDF. Please install AG2 (with the neo4j extra) and the dependencies:\n",
    "\n",
    "- Install Poppler https://pdf2image.readthedocs.io/en/latest/installation.html\n",
    "- Install Tesseract https://tesseract-ocr.github.io/tessdoc/Installation.html\n",
    "- `pip install -U ag2[openai,neo4j], unstructured==0.16.11, pi-heif==0.21.0, unstructured_inference==0.8.1, unstructured.pytesseract==0.3.13, pytesseract==0.3.13`\n",
    "\n",
    "> **Note:** If you have been using `autogen` or `ag2`, all you need to do is upgrade it using:  \n",
    "> ```bash\n",
    "> pip install -U autogen[openai,neo4j]\n",
    "> ```\n",
    "> or  \n",
    "> ```bash\n",
    "> pip install -U ag2[openai,neo4j]\n",
    "> ```\n",
    "> as `autogen`, and `ag2` are aliases for the same PyPI package.  \n",
    "\n",
    ":::\n",
    "````\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set Configuration and OpenAI API Key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import autogen\n",
    "\n",
    "config_list = autogen.config_list_from_json(\n",
    "    \"OAI_CONFIG_LIST\",\n",
    "    filter_dict={\n",
    "        \"model\": [\"gpt-4.1\"],\n",
    "    },\n",
    ")\n",
    "os.environ[\"OPENAI_API_KEY\"] = config_list[0][\"api_key\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse PDF file\n",
    "\n",
    "**Skip and use parsed files to run the rest.**\n",
    "This step is expensive and time consuming, please skip if you don't need to generate the full data set. The **estimated cost is from $10 to $15 to parse the pdf file and build the knowledge graph with entire parsed output**.\n",
    "\n",
    "For the notebook, we use a common financial document, [Nvidia 2024 10-K](https://investor.nvidia.com/financial-info/sec-filings/sec-filings-details/default.aspx?FilingId=17293267) as an example ([file download link](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/1cbe8fe7-e08a-46e3-8dcc-b429fc06c1a4.pdf)).\n",
    "\n",
    "We use Unstructured-IO to parse the PDF, the table and image from the PDF are extracted out as .jpg files.\n",
    "\n",
    "All parsed output are saved in a JSON file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from unstructured.partition.pdf import partition_pdf\n",
    "from unstructured.staging.base import elements_to_json\n",
    "\n",
    "file_elements = partition_pdf(\n",
    "    filename=\"./input_files/nvidia_10k_2024.pdf\",\n",
    "    strategy=\"hi_res\",\n",
    "    languages=[\"eng\"],\n",
    "    infer_table_structure=True,\n",
    "    extract_images_in_pdf=True,\n",
    "    extract_image_block_output_dir=\"./parsed_pdf_info\",\n",
    "    extract_image_block_types=[\"Image\", \"Table\"],\n",
    "    extract_forms=False,\n",
    "    form_extraction_skip_tables=False,\n",
    ")\n",
    "\n",
    "elements_to_json(elements=file_elements, filename=\"parsed_elements.json\", encoding=\"utf-8\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create sample dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "output_elements = []\n",
    "keys_to_extract = [\"element_id\", \"text\", \"type\"]\n",
    "metadata_keys = [\"page_number\", \"parent_id\", \"image_path\"]\n",
    "text_types = {\"Text\", \"UncategorizedText\", \"NarrativeText\"}\n",
    "element_length = len(file_elements)\n",
    "for idx in range(element_length):\n",
    "    data = file_elements[idx].to_dict()\n",
    "    new_data = {key: data[key] for key in keys_to_extract}\n",
    "    metadata = data[\"metadata\"]\n",
    "    for key in metadata_keys:\n",
    "        if key in metadata:\n",
    "            new_data[key] = metadata[key]\n",
    "    if data[\"type\"] == \"Table\":\n",
    "        if idx > 0:\n",
    "            pre_data = file_elements[idx - 1].to_dict()\n",
    "            if pre_data[\"type\"] in text_types:\n",
    "                new_data[\"text\"] = pre_data[\"text\"] + new_data[\"text\"]\n",
    "        if idx < element_length - 1:\n",
    "            post_data = file_elements[idx + 1].to_dict()\n",
    "            if post_data[\"type\"] in text_types:\n",
    "                new_data[\"text\"] = new_data[\"text\"] + post_data[\"text\"]\n",
    "    output_elements.append(new_data)\n",
    "\n",
    "with open(\"proessed_elements.json\", \"w\", encoding=\"utf-8\") as file:\n",
    "    json.dump(output_elements, file, indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports\n",
    "\n",
    "**If you want to skip the parsing of the PDF file, you can start here.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is needed to allow nested asyncio calls for Neo4j in Jupyter\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "from autogen import AssistantAgent, ConversableAgent, UserProxyAgent\n",
    "\n",
    "# load documents\n",
    "from autogen.agentchat.contrib.graph_rag.document import Document, DocumentType\n",
    "from autogen.agentchat.contrib.graph_rag.neo4j_graph_query_engine import Neo4jGraphQueryEngine\n",
    "from autogen.agentchat.contrib.graph_rag.neo4j_graph_rag_capability import Neo4jGraphCapability\n",
    "from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a knowledge graph with sample data\n",
    "\n",
    "To save time and cost, we use a small subset of the data for the notebook.\n",
    "\n",
    "**This does not change the fact that the native RAG agent solution failed to provide the correct answer.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_path = \"./agentchat_pdf_rag/sample_elements.json\"\n",
    "input_documents = [\n",
    "    Document(doctype=DocumentType.JSON, path_or_url=input_path),\n",
    "]\n",
    "\n",
    "query_engine = Neo4jGraphQueryEngine(\n",
    "    username=\"neo4j\",  # Change if you reset username\n",
    "    password=\"password\",  # Change if you reset password\n",
    "    host=\"bolt://172.17.0.3\",  # Change\n",
    "    port=7687,  # if needed\n",
    "    llm=OpenAI(model=\"gpt-4.1\", temperature=0.0),  # Default, no need to specify\n",
    "    embedding=OpenAIEmbedding(model_name=\"text-embedding-3-small\"),  # except you want to use a different model\n",
    "    database=\"neo4j\",  # Change if you want to store the graphh in your custom database\n",
    ")\n",
    "\n",
    "# query_engine._clear()\n",
    "# Ingest data and create a new property graph\n",
    "query_engine.init_db(input_doc=input_documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connect to knowledge graph if it is built"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = Neo4jGraphQueryEngine(\n",
    "    username=\"neo4j\",\n",
    "    password=\"password\",\n",
    "    host=\"bolt://172.17.0.3\",\n",
    "    port=7687,\n",
    "    database=\"neo4j\",\n",
    ")\n",
    "\n",
    "query_engine.connect_db()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Native RAG Agent Solution\n",
    "\n",
    "The following shows that when use a native RAG agent for parsed data, the agent failed to get the right information (5,282 instead of 4,430).\n",
    "\n",
    "Our best guess is that RAG agent fails to understand the table structure from text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rag_agent = ConversableAgent(\n",
    "    name=\"nvidia_rag\",\n",
    "    human_input_mode=\"NEVER\",\n",
    ")\n",
    "\n",
    "# Associate the capability with the agent\n",
    "graph_rag_capability = Neo4jGraphCapability(query_engine)\n",
    "graph_rag_capability.add_to_agent(rag_agent)\n",
    "\n",
    "# Create a user proxy agent to converse with our RAG agent\n",
    "user_proxy = UserProxyAgent(\n",
    "    name=\"user_proxy\",\n",
    "    human_input_mode=\"ALWAYS\",\n",
    ")\n",
    "\n",
    "user_proxy.initiate_chat(rag_agent, message=\"Could you list all tables from the document and its image_path?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Agentic RAG workflow for tabular data\n",
    "From the above example, when asked the goodwill asset (in millions) of the table NVIDIA Corporation and Subsidiaries Consolidated Balance Sheets, the answer was wrong.\n",
    "The correct figure from the table is $4,430 million instead of $4,400 million. \n",
    "To enhance the RAG performance from the tabular data, we introduce the enhanced workflow.\n",
    "\n",
    "The workflow consists a group of agent and use groupchat to coordinate. It breaks the RAG into 3 mains steps,\n",
    "1. it finds the parsed image of the corresponding table.\n",
    "2. it converts the image to table in structured Markdown format.\n",
    "3. With the table in Markdown, the workflow answer the question with the correct data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm_config = {\n",
    "    \"cache_seed\": 42,  # change the cache_seed for different trials\n",
    "    \"temperature\": 1,\n",
    "    \"config_list\": config_list,\n",
    "    \"timeout\": 120,\n",
    "}\n",
    "\n",
    "user_proxy = UserProxyAgent(\n",
    "    name=\"User_proxy\",\n",
    "    system_message=\"A human admin.\",\n",
    "    human_input_mode=\"ALWAYS\",  # Try between ALWAYS or NEVER\n",
    "    code_execution_config=False,\n",
    ")\n",
    "\n",
    "table_assistant = AssistantAgent(\n",
    "    name=\"table_assistant\",\n",
    "    system_message=\"\"\"You are a helpful assistant.\n",
    "    You will extract the table name from the message and reply with \"Find image_path for Table: {table_name}\".\n",
    "    For example, when you got message \"What is column data in table XYZ?\",\n",
    "    you will reply \"Find image_path for Table: XYZ\"\n",
    "    \"\"\",\n",
    "    llm_config=llm_config,\n",
    "    human_input_mode=\"NEVER\",  # Never ask for human input.\n",
    ")\n",
    "\n",
    "rag_agent = ConversableAgent(\n",
    "    name=\"nvidia_rag\",\n",
    "    human_input_mode=\"NEVER\",\n",
    ")\n",
    "\n",
    "# Associate the capability with the agent\n",
    "graph_rag_capability = Neo4jGraphCapability(query_engine)\n",
    "graph_rag_capability.add_to_agent(rag_agent)\n",
    "\n",
    "img_folder = \"/workspaces/ag2/notebook/agentchat_pdf_rag/parsed_pdf_info\"\n",
    "\n",
    "img_request_format = ConversableAgent(\n",
    "    name=\"img_request_format\",\n",
    "    system_message=f\"\"\"You are a helpful assistant.\n",
    "    You will extract the table_file_name from the message and reply with \"Please extract table from the following image and convert it to Markdown.\n",
    "    <img {img_folder}/table_file_name>.\".\n",
    "    For example, when you got message \"The image path for the table titled XYZ is \"./parsed_pdf_info/abcde\".\",\n",
    "    you will reply \"Please extract table from the following image and convert it to Markdown.\n",
    "    <img {img_folder}/abcde>.\"\n",
    "    \"\"\",\n",
    "    llm_config=llm_config,\n",
    "    human_input_mode=\"NEVER\",\n",
    ")\n",
    "\n",
    "image2table_convertor = MultimodalConversableAgent(\n",
    "    name=\"image2table_convertor\",\n",
    "    system_message=\"\"\"\n",
    "    You are an image to table converter. You will process an image of one or multiple consecutive tables.\n",
    "    You need to follow the following steps in sequence,\n",
    "    1. extract the complete table contents and structure.\n",
    "    2. Make sure the structure is complete and no information is left out. Otherwise, start from step 1 again.\n",
    "    3. Correct typos in the text fields.\n",
    "    4. In the end, output the table(s) in Markdown.\n",
    "    \"\"\",\n",
    "    llm_config={\"config_list\": config_list, \"max_tokens\": 300},\n",
    "    human_input_mode=\"NEVER\",\n",
    "    max_consecutive_auto_reply=1,\n",
    ")\n",
    "\n",
    "conclusion = AssistantAgent(\n",
    "    name=\"conclusion\",\n",
    "    system_message=\"\"\"You are a helpful assistant.\n",
    "    Base on the history of the groupchat, answer the original question from User_proxy.\n",
    "    \"\"\",\n",
    "    llm_config=llm_config,\n",
    "    human_input_mode=\"NEVER\",  # Never ask for human input.\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groupchat = autogen.GroupChat(\n",
    "    agents=[\n",
    "        user_proxy,\n",
    "        table_assistant,\n",
    "        rag_agent,\n",
    "        img_request_format,\n",
    "        image2table_convertor,\n",
    "        conclusion,\n",
    "    ],\n",
    "    messages=[],\n",
    "    speaker_selection_method=\"round_robin\",\n",
    ")\n",
    "manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)\n",
    "chat_result = user_proxy.initiate_chat(\n",
    "    manager,\n",
    "    message=\"What is goodwill asset (in millions) for 2024 in table NVIDIA Corporation and Subsidiaries Consolidated Balance Sheets?\",\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "front_matter": {
   "description": "Agentic RAG workflow on tabular data from a PDF file",
   "tags": [
    "RAG",
    "groupchat"
   ]
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
